Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial version of aarch64 container with Vulkan #270

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sroecker
Copy link

@sroecker sroecker commented Oct 9, 2024

Initial version for aarch64 container with Vulkan support that runs on libkrun containers on MacOs

@ericcurtin
Copy link
Collaborator

I think we should merge this... But we do have a vulkan image based on kompute also... It's all about naming... Please working with @slp and @rhatdan to agree on names...

@ericcurtin
Copy link
Collaborator

@sroecker please sign your commit, this is failing the DCO build:

git commit --amend -s

to sign an old commit.

@sroecker sroecker force-pushed the add_aarch64_vulkan_container branch from 40672af to 3e41c4e Compare October 9, 2024 16:00
@rhatdan
Copy link
Member

rhatdan commented Oct 9, 2024

I would prefer these all to be based off a base image with all of the python tools required to run ramalama and then rocm, vulcan, ... can all share the lower layer.

@slp
Copy link
Collaborator

slp commented Oct 9, 2024

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

@sroecker
Copy link
Author

sroecker commented Oct 9, 2024

@sroecker is llama.cpp working properly with you with a container generated from this Containerfile? Which models have you tested?

I'm asking because the Vulkan backend hasn't worked for me since March, which is the reason why I started favoring the Kompute (which also uses Vulkan) backend.

I had to test a smaller model due to machine constraints:
https://huggingface.co/MaziyarPanahi/SmolLM-1.7B-Instruct-GGUF/blob/main/SmolLM-1.7B-Instruct.Q4_K_M.gguf

ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: Virtio-GPU Venus (Apple M1 Pro) (venus) | uma: 1 | fp16: 1 | warp size: 32
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors: Virtio-GPU Venus (Apple M1 Pro) buffer size =  1005.01 MiB
llm_load_tensors:        CPU buffer size =    78.75 MiB
...................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: Virtio-GPU Venus (Apple M1 Pro) KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model: Vulkan_Host  output buffer size =     0.19 MiB
llama_new_context_with_model: Virtio-GPU Venus (Apple M1 Pro) compute buffer size =   148.00 MiB
llama_new_context_with_model: Vulkan_Host compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
llama_init_from_gpt_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 5

system_info: n_threads = 5 (n_threads_batch = 5) / 5 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 1 | SVE = 0 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 3671259997
sampler params:
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 2048, n_predict = -1, n_keep = 0

Fibonacci: The Fibonacci sequence is a classic example of a recurrent sequence that can be used to model various phenomena, such as the growth of populations or the stock market.

In summary, recurrences are essential for modeling dynamic systems and capturing the underlying patterns and behaviors of these systems over time. [end of text]


llama_perf_sampler_print:    sampling time =       1.97 ms /    63 runs   (    0.03 ms per token, 32061.07 tokens per second)
llama_perf_context_print:        load time =    1363.18 ms
llama_perf_context_print: prompt eval time =     614.52 ms /     4 tokens (  153.63 ms per token,     6.51 tokens per second)
llama_perf_context_print:        eval time =    1134.46 ms /    58 runs   (   19.56 ms per token,    51.13 tokens per second)
llama_perf_context_print:       total time =    1755.12 ms /    62 tokens

I can check with the kompute backend tomorrow.

@slp
Copy link
Collaborator

slp commented Oct 9, 2024

Tested with Mistral-7B and Wizard-Vicuna-13B and got random answers with both of them. Sadly, the Vulkan backend is still broken for Apple Silicon GPUs upstream.

I think we're going to need to stay for a while with the Kompute backend, as implemented in #235.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants